Unsupervised Acquiring of Morphological Paradigms from Tokenized Text
نویسنده
چکیده
This paper describes a rather simplistic method of unsupervised morphological analysis of words in an unknown language. All what is needed is a raw text corpus in the given language. The algorithm looks at words, identifies repeatedly occurring stems and suffixes, and constructs probable morphological paradigms. The paper also describes how this method has been applied to solve the Morpho Challenge 2007 task, and gives the Morpho Challenge results. Although the present work was originally a student project without any connection or even knowledge of related work, its simple approach outperformed, to our surprise, several others in most morpheme segmentation subcompetitions. We believe that there is enough room for improvements that can put the results even higher. Errors are discussed in the paper; together with suggested adjustments in future research.
منابع مشابه
Resource-Light Acquisition of Inflectional Paradigms
This paper presents a resource-light acquisition of morphological paradigms and lexicon for fusional languages. It builds upon Paramor [10], an unsupervised system, by extending it: (1) to accept a small seed of manually provided word inflections with marked morpheme boundary; (2) to handle basic allomorphic changes acquiring the rules from the seed and/or from previously acquired paradigms. Th...
متن کاملMorphological Paradigms: Computational Structure and Unsupervised Learning
This thesis explores the computational structure of morphological paradigms from the perspective of unsupervised learning. Three topics are studied: (i) stem identification, (ii) paradigmatic similarity, and (iii) paradigm induction. All the three topics progress in terms of the scope of data in question. The first and second topics explore structure when morphological paradigms are given, firs...
متن کاملElephant: Sequence Labeling for Word and Sentence Segmentation
Tokenization is widely regarded as a solved problem due to the high accuracy that rulebased tokenizers achieve. But rule-based tokenizers are hard to maintain and their rules language specific. We show that highaccuracy word and sentence segmentation can be achieved by using supervised sequence labeling on the character level combined with unsupervised feature learning. We evaluated our method ...
متن کاملTree Structured Dirichlet Processes for Hierarchical Morphological Segmentation
This article presents a probabilistic hierarchical clustering model for morphological segmentation. In contrast to existing approaches to morphology learning, our method allows learning hierarchical organization of word morphology as a collection of tree structured paradigms. The model is fully unsupervised and based on the hierarchical Dirichlet process (HDP). Tree hierarchies are learned alon...
متن کاملNetworks of morphological relations
There has been a considerable amount of research on computerized methods of inducing morphology. The majority of these efforts have lately been directed toward “unsupervised” techniques, which in this context means that some morphological knowledge is induced from raw text (Hammarström and Borin 2011). “Morphological knowledge” is usually taken to mean something that a linguistic analysis would...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007